The Evolution of a Hierarchical Partitioning Algorithm for Large-Scale Scientific Data: Three Steps of Increasing Complexity
نویسندگان
چکیده
As scientific data sets grow exponentially in size, the need for scalable algorithms that heuristically partition the data increases. In this paper, we describe the threestep evolution of a hierarchical partitioning algorithm for large-scale spatio-temporal scientific data sets generated by massive simulations. The first version of our algorithm uses a simple top-down partitioning technique, which divides the data by using a four-way bisection of the spatio-temporal space. The shortcomings of this algorithm lead to the second version of our partitioning algorithm, which uses a bottom-up approach. In this version, a partition hierarchy is constructed by systematically agglomerating the underlying Cartesian grid that is placed on the data. Finally, the third version of our algorithm utilizes the intrinsic topology of the data given in the original scientific problem to build the partition hierarchy in a bottom-up fashion. Specifically, the topology is used to heuristically agglomerate the data at each level of the partition hierarchy. Despite the growing complexity in our algorithms, the third version of our algorithm builds partition hierarchies in less time and is able to build trees for larger size data sets as compared to the previous two versions. 1. Three hierarchical partitioning algorithms Scalable algorithms are needed to partition tera-scale data sets [1, 5]. This is especially true in scientific domains, where sizes of the data sets have grown exponentially in recent years. We describe the evolution of a hierarchical partitioning algorithm for large-scale scientific data sets. Specifically, large-scale simulation programs produce our data sets in mesh format. A data set in mesh format consists of interconnected grids of small zones, in which data points are stored. Figure 1 depicts the mesh produced from an astrophysics simulation of a star in its mid-life. Mesh data usually varies with time, consists of multiple dimensions (i.e., variables), and can contain irregular grids. Musick and Critchlow provide a nice introduction to scientific mesh data [4]. Figure 1. A Mesh Data Set Representing a Star The first and simplest version of our partitioning algorithm employs a top-down partitioning technique by performing a four-dimensional bisection on the spatiotemporal space. The major advantage of this approach is the generation of a global decomposition of the data. However, this global partitioning comes with three major drawbacks. First, it is computationally too expensive to scale well to tera-byte data sets. This is largely due to its need to convert a mesh data file from its original simulation-specific format into a consistent vector-based representation. Second, it is not able to capture the information stored in the topology of a mesh data set. Lastly, the bisection procedure works best when there is a uniform density of grid cells throughout the whole problem domain. Typically, however, our domains have complex structures such as non-uniform distributions of grid cells, irregular boundaries, and unusual topologies. Figure 2 shows two examples of such domains. Figure 2. Examples of Complex Domain Structures (a) L-Shaped Domain (b) Rectilinear Domain with Edges Glued Together To address the above issues, our algorithm evolves to a bottom-up approach. First, however, we remove the time dimension from the partitioning space and redefine our partitions on the three-dimensional spatial structure of the data. This new partition space allows us to produce hierarchies that can easily be parallelized for data access. The second version of our algorithm (called GRID) utilizes a grid-based bottom-up partitioning approach. GRID constructs a hierarchy by systematically agglomerating the underlying Cartesian grid that is placed on a mesh data set. Specifically, a simple coarsing strategy starts at the initial grid configuration and iteratively produces coarse level collections of cells from fine level collections of grid cells. Unlike our top-down approach, GRID scales well to large data sets, deals effectively with irregularities of the grid, and produces hierarchies with better structure than the top-down algorithm. However, it is still not able to capture the topological information (i.e., the true physical relationships of the grid cells) of a mesh data set The third version of our algorithm, called TOPOLOGY, improves on the previous bottom-up approach by utilizing the intrinsic topology of the data given in the original scientific problem to build the partition hierarchy. TOPOLOGY uses a two-pass approach. In the first pass, each coarse cell is assigned the “best” neighborhood configuration (with respect to its rectilinear cell shape). This operation is a local search on the 2 possible neighborhood configurations of a coarse cell, where N is the number of dimensions. For instance, in two dimensions, the four possible locations for a given cell (within a coarse agglomeration) are denoted by the grey boxes in Figure 3. Figure 3. Four Possible Locations for a Cell within a Coarse Agglomeration in Two Dimensions Since the first pass of TOPOLOGY is a local operation on cells, no information about the past and future agglomerations in other regions of the domain is taken into consideration when creating ancestor-descendent relationships. For this reason, some coarse agglomerations can result in trees that are non-binary, non-quad, or non-octree. For instance, it is easy to be in a situation (after the first pass) where the coarse cells are arranged as shown in Figure 4. Figure 4. A Non-Quad Tree Coarse Cell Arrangement The coarse cells (C1, C2, and C3 given by solid lines) have been arranged in such a way that indeterminate behavior for neighbors exists for the coarse cells. For example, C2 has two neighbors to its right. The second pass corrects such structural problems associated with indeterminate behavior for neighbors of coarse cells. In particular, the second pass has N-dimensional subphases. Each subphase, s, corrects the (N – s) dimensional structures, planes, lines, and points. Each subphase uses information from all the previous subphases to correctly place the coarse cells. It is important to note that in the second pass, only neighbor relations are adjusted and not the coarse cells (which were defined in the first pass). For example, in two dimensions, the problem illustrated in Figure 4 can be fixed by (i) adjusting the face neighbors so that cell C2 “slides” down half of a coarse grid cell and (ii) making sure the neighbors for all local coarse cells reflect this slide (see Figure 5). Figure 5. A Fix for a Non-Quad Tree Coarse Cell Arrangements A heuristically complex procedure is used to compute these corrections. Our correction procedure utilizes the information about the (faces, edges, and corners of) neighbors of the coarse cells’ descendents to establish neighbors at the coarse level. For instance, to find the neighbors for C2 (shown in Figure 4), we utilize the information for neighbors of cells 1, 2, 5, 6, 9, 10, 11, and 12. In our topology-based algorithm, a new coarse level is created in the first pass and neighbors of coarse cells are identified in the second pass. The second pass rearranges the grid somewhat. The degree to which the domain of coarse cells is rearranged is bounded by the fine-cell sized moves. The degree to which a coarse level “fits” a fine level can be measured by the number of ancestordescendent relationships that are established verses the number which could be established. This measure, C3 C1 C2 3 4 7 8 5 6 9 10 11 12
منابع مشابه
Assessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories
In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...
متن کاملGraph Clustering by Hierarchical Singular Value Decomposition with Selectable Range for Number of Clusters Members
Graphs have so many applications in real world problems. When we deal with huge volume of data, analyzing data is difficult or sometimes impossible. In big data problems, clustering data is a useful tool for data analysis. Singular value decomposition(SVD) is one of the best algorithms for clustering graph but we do not have any choice to select the number of clusters and the number of members ...
متن کاملWell Placement Optimization Using Differential Evolution Algorithm
Determining the optimal location of wells with the aid of an automated search algorithm is a significant and difficult step in the reservoir development process. It is a computationally intensive task due to the large number of simulation runs required. Therefore,the key issue to such automatic optimization is development of algorithms that can find acceptable solutions with a minimum numbe...
متن کاملDesign and Evaluation of a Method for Partitioning and Offloading Web-based Applications in Mobile Systems with Bandwidth Constraints
Computation offloading is known to be among the effective solutions of running heavy applications on smart mobile devices. However, irregular changes of a mobile data rate have direct impacts on code partitioning when offloading is in progress. It is believed that once a rate-adaptive partitioning performed, the replication of such substantial processes due to bandwidth fluctuation can be avoid...
متن کاملCONSTRAINED BIG BANG-BIG CRUNCH ALGORITHM FOR OPTIMAL SOLUTION OF LARGE SCALE RESERVOIR OPERATION PROBLEM
A constrained version of the Big Bang-Big Crunch algorithm for the efficient solution of the optimal reservoir operation problems is proposed in this paper. Big Bang-Big Crunch (BB-BC) algorithm is a new meta-heuristic population-based algorithm that relies on one of the theories of the evolution of universe namely, the Big Bang and Big Crunch theory. An improved formulation of the algorithm na...
متن کامل